Instructions:
In Quiz_data.Rdata you will find a data set called data with three variables: Y and X1, X2. For the following, you should use the original data and no standardization should be applied.
#(Type your code in the space below)
load("Quiz_data.Rdata")
head(data, 15)
# We count the number of observations
nrow(data)
## [1] 100
There are 100 observations in this dataset.
library(ggplot2)
library(GGally)
library(plotly)
library(ggpubr)
#(Type your code in the space below)
Yhist <- ggplot(data, aes(x = Y)) + geom_histogram(bins =15, fill = "#7255C0")
X1hist <- ggplot(data, aes(x = X1)) + geom_histogram(bins =15, fill = "#7255C0")
X2hist <- ggplot(data, aes(x = X2)) + geom_histogram(bins =15, fill = "#7255C0")
# Making panels
ggarrange(Yhist, X1hist, X2hist, nrow = 2, ncol = 2)
\
As seen in the dataframe above, all variables are type doubles. For our \(Y\) variable, we can see there is a right skew, \(X_1\) and \(X_2\) are approximately normal, however \(X_1\) has a heavier left tail. \(X_2\) looks the most normal.
# (Type your code in the space below)
#We could output correlation matrix
# cor(data)
#This plot will show scatterplot and correlation
ggplotly(ggpairs(data, title = "Correlogram of Data"))
From the correlations, we can see that Y has a moderate strength correlation with both \(X_1\)and \(X_2\) with values \(0.487\) and \(0.399\) respectively. Both these relationships are positive and linear, as denoted by the scatterplots. We can see in the scatterplot the stronger linear relationship between \(Y\) and \(X_1\). It should be noted that \(X_1\) and \(X_2\) have a moderately weak, positive correlation, giving no signs of multicollinearity.
# (Type your code in the space below)
# We fit a first oder model
model1 <- lm(Y ~ X1 + X2, data = data)
model1$coefficients
## (Intercept) X1 X2
## 5.9268528 0.7874189 0.4993759
We have two regression coefficients, our partial slopes for the \(X_1\) and \(X_2\), which are \(\hat{\beta_1}\) and \(\hat{\beta_2}\) respectively. We also have an intercept \(\hat{\beta_0}\).
summary(model1)
##
## Call:
## lm(formula = Y ~ X1 + X2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1897 -1.0656 -0.3424 0.3960 5.7706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9269 0.5246 11.298 < 2e-16 ***
## X1 0.7874 0.1764 4.464 2.17e-05 ***
## X2 0.4994 0.1662 3.005 0.00338 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.659 on 97 degrees of freedom
## Multiple R-squared: 0.3023, Adjusted R-squared: 0.288
## F-statistic: 21.02 on 2 and 97 DF, p-value: 2.609e-08
From our model diagnostics, we can see that any reasonable \(\alpha\), our regression parameters are significant individually in a t-test and in an overall test for the model in an F test, leading us to believe the model is significant. Our \(R^2\) and \(R^2_a\) are not too impressive though, only at \(0.3203\) and \(0.288\) respectively. We can see that the variable \(X_2\) has a larger p-value, suggesting that is not as significant as \(X_1\) and the intercept.
# We create our new squared variables
data["X1sq"] <- data$X1^2
data["X2sq"] <- data$X2^2
data["X1X2int"] <- data$X1 * data$X2
#Build model 2
model2 <- lm(Y ~ X1 + X2 + X1sq+ X2sq + X1X2int, data = data)
# We can find the variance inflation factors by taking inverse of correlation matrix and finding the diagonals
diag(solve(cor(data)))
## Y X1 X2 X1sq X2sq X1X2int
## 4.637475 13.992204 27.535556 3.504896 28.029851 13.501629
There appears to be a strong multicollinearity between \(X_2\) and \(X_2^2\), since their VIF’s are much larger than 10 (around 28). We can also see that there is multicollinearity between \(X_1\) and \(X_1X_2\) interaction term, with a VIF greater than 10 of around 13.
summary(model2)
##
## Call:
## lm(formula = Y ~ X1 + X2 + X1sq + X2sq + X1X2int, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.20233 -0.60960 -0.07387 0.57877 2.31998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.25851 0.70035 7.508 3.39e-11 ***
## X1 0.93613 0.33911 2.761 0.00694 **
## X2 0.16454 0.46573 0.353 0.72467
## X1sq 0.99757 0.07668 13.009 < 2e-16 ***
## X2sq 0.06977 0.07475 0.933 0.35304
## X1X2int -0.05632 0.11013 -0.511 0.61031
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9371 on 94 degrees of freedom
## Multiple R-squared: 0.7844, Adjusted R-squared: 0.7729
## F-statistic: 68.38 on 5 and 94 DF, p-value: < 2.2e-16
In the F test for the model, we do reject \(H_0\) and conclude that at least one of the regression coefficients is non-zero, just as we did for model 1, However, if we look at some of the individual t-tests for the regression coefficients, we can see that not all the coefficients are significant, such as \(X_2\), \(X_2^2\), and \(X_1X_2\).Our cofficient of multiple determination is much higher, however, we must also keep in mind that a part of this is due to the large increase in number of parameters. The \(R^2_a\), a better indicator of fit when adding more terms, still gives us promise, with a value of 0.7729 for model 2. In short, the model assumptions for this model 2 hold, but not much better than model 1.
predict(model2, data.frame(X1 = 0, X2 = 0, X1sq = 0, X2sq = 0, X1X2int = 0), interval = "confidence", level = 0.99)
## fit lwr upr
## 1 5.258508 3.41719 7.099826
We are 99% confident that the true mean response when \(X1=X2=0\) is within the interval \([3.41719, 7.099826]\)
modelRed <- lm(Y ~ X1 + X1sq, data = data)
anova(model2, modelRed)
Since we have an extremely low p-value, less than \(\alpha = 0.01\), we reject \(H_0\), concluding that one of the coefficients involving \(X_2\) is non-zero.
# Lets take out the interaction term, and X2, but keep X2 squared
model3 <- lm(Y ~ X1 + X1sq + X2sq, data = data)
summary(model3)
##
## Call:
## lm(formula = Y ~ X1 + X1sq + X2sq, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.20193 -0.61475 -0.09524 0.59624 2.27067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.52389 0.18877 29.263 < 2e-16 ***
## X1 0.78262 0.09822 7.968 3.32e-12 ***
## X1sq 0.96897 0.06817 14.214 < 2e-16 ***
## X2sq 0.09320 0.01478 6.306 8.75e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9319 on 96 degrees of freedom
## Multiple R-squared: 0.7822, Adjusted R-squared: 0.7754
## F-statistic: 114.9 on 3 and 96 DF, p-value: < 2.2e-16
We know by the Partial \(F\) test that one of the coefficients involving \(X_2\) is non-zero. By removing the two variables with the highest p-value in their individual t-tests (\(X_2\) and the interaction term), we have found a model with less regression coefficients and a higher \(R^2_a\) of \(0.7754\). If we observe the individual t-tests for all the terms, as well as the overall F test, we find statistical significance for all the parameters and the model.